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ABSTRACT 

In addition to the frequency of terms in a document collec- 
tion, the distribution of terms plays an important role in 
determining the relevance of documents for a given search 
query. In this paper, term distribution analysis using Fourier 
series expansion as a novel approach for calculating an ab- 
stract representation of term positions in a document cor- 
pus is introduced. Based on this approach, two methods 
for improving the evaluation of document relevance are pro- 
posed: (a) a function-based ranking optimization represent- 
ing a user defined document region, and (b) a query expan- 
sion technique based on overlapping the term distributions 
in the top-ranked documents. Experimental results demon- 
strate the effectiveness of the proposed approach in provid- 
ing new possibilities for optimizing the retrieval process. 
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expansion of a rectangular function describing the term po- 
sitions in the document. In addition, a document ranking 
optimization procedure, based on objective query functions 
determining a user defined document region, is proposed as 
an alternative to the well-known term frequency metrics. 
Furthermore, a query expansion algorithm is introduced. It 
is based on overlapping the distributions of query terms in 
the top-ranked documents. Experimental results obtained 
for the TREC-8 document collection demonstrate that the 
proposed approach is superior to state-of-the-art relevance 
feedback techniques such as Rocchio and Divergence from 
Randomness models |31[ [l]. 

The paper is organized as follows: Section 2 outlines related 
work on contextual information retrieval. In Section 3, term 
distribution analysis using Fourier series expansion is pre- 
sented. The comparison of term distributions is described in 
Section 4. Section 5 discusses experimental results. Section 
6 concludes the paper and outlines areas for future research. 



1. INTRODUCTION 

One way to address the problems of synonymy and poly- 
semy associated with lexical matching methods [s] in text 
retrieval applications is to consider contextual information 
[18| . In fact, several search engines make use of contextual 
information to disambiguate query terms [23]. Contextual 
information is either obtained from the user, from the doc- 
ument structure or from the text itself by performing some 
form of statistical analysis, such as counting the frequency 
and/or distance of terms. 

In this paper, a text retrieval approach that incorporates 
novel contextual analysis and document ranking methods 
is presented. The proposed approach, called Fourier Vec- 
tor Scoring, is based on an abstract description of the term 
positions in a document, represented by the Fourier series 



2. RELATED WORK 

Contextual information can be obtained in two ways: by the 
text surrounding the search terms in the document corpus, 
or by the context delivered by the user (i.e. personalization) 
[18| . There are approaches that utilize the query history of 
users 33 or the text surrounding the query 15 29 to build 



augmented queries (i.e. query expansion) for improving the 
performance of interactive retrieval systems. 

Relevance feedback is the most popular query expansion 
strategy [32[ [t]. Here, the expanded terms are typi- 
cally extracted from the retrieved documents and judged as 
relevant in a previous retrieval iteration. As demonstrated 
in several experimental studies, relevance feedback systems 
are quite effective [30[ [s]. However, the browsing process 
required to determine the relevance of a document has been 
widely recognized as a significant limitation by the informa- 
tion retrieval research community. 



To overcome the intervention of the user in the relevance 
feedback process, two basic types of strategies have been pro- 
posed: automatic global analysis and automatic local analy- 
sis. In automatic global analysis, all documents of the collec- 
tion are used to determine a thesaurus-like structure, defin- 
ing term-to-term relationships within the document corpus. 
In general, global analysis techniques are limited to small 



database applications, where doubtful improvements have 
been observed [2]. In automatic local analysis, the system 
is able to estimate the relevance of the first retrieved docu- 
ments without user intervention. The main idea is to con- 
sider the top-n initially retrieved documents as relevant, and 
to use statistical heuristics to identify query related terms 
[13[ |37| . Noise and multiple topics are two major nega- 
tive factors for expansion term selection [39]. To deal with 
these problems, traditional clustering methods have been 
proposed [T7|. The experiments performed by Fan et al. [14] 
confirm that highly-tuned ranking offers more high-quality 
documents at the top of the hit list. 

Typically, it is difficult to determine correlated terms in- 
side a document, because these terms do not necessarily co- 
occur very frequently with the original query terms if the 
document is considered as a whole. In fact, it is common 
to have unrelated terms co-occurring with query terms very 
frequently 35 . To address this problem, page segmenta- 
tion strategies have been suggested [39j [O] . They provide a 



better document partitioning at the semantic level and re- 
duce the probability to carry irrelevant terms to the query 
expansion process. In general, an important drawback of au- 
tomatic local analysis strategies is the considerable amount 
of computation, which represents a substantial problem for 
interactive systems [24| . 

Katz f2T' has analyzed the distribution of content-bearing 
terms in technical documents. Important concepts support- 
ing word occurrence models, such as inter- / within- document 
relationships, topicality and burstiness are proposed. The 
author concentrates on the modeling of the inter- document 
distributions of content words, while our work focuses on 
the within- document relationships applied to relevance eval- 
uation in the information retrieval process. 



analytic Fourier transform, thus permitting an immediate 
and simple term comparison process. 



3. TERM DISTRIBUTION ANALYSIS US- 
ING FOURIER SERIES 

Fourier analysis is based on the idea that functions can be 
approximated by a sum of sine and cosine waves at different 
frequencies. The more sinusoids are included in the sum, the 
better the approximation. There are several applications of 
Fourier analysis in the field of information retrieval (IR), 
such as audio-IR flo], image-IR [16], and in text-IR [22| . 

Consider a function f(x) that is defined for x £ [Q, L]. A 
Fourier series expansion is an expansion 
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where the coefficients Ok and bk have to be determined. If 
the sum over k is restricted to fc < n, the Fourier series gives 
an approximation fn{x) to the function f{x) called the n-th 
order Fourier approximation of f{x). 

Consider a document D containing L terms. To characterize 
the distribution of a particular term t within the document, 
the set of positions of all occurrences of f in D is denoted as 
Vt, where all terms are enumerated starting with 1 for the 
first term in the document and so on. 

As exemplified in Figure [l] Vt ~ {3,8} represents the fact 
that the two instances of the term t in the document D are 
located in the third and the eighth position of the document 
body. 



An approach to apply term positional data in retrieval feed- 
back is the work of Attar and Fraenkel [5]. They propose 
different models to generate clusters of terms related to a 
query (searchonyms) and use these clusters in a local feed- 
back process. In their experiments with English and He- 
brew documents, they confirm that metrical methods based 
on functions of the distance between terms are superior to 
methods based merely on weighted co-occurrences of terms. 

Several approaches based on Hidden Markov Models have 
been proposed [25[ |26[ [38] . For example. Miller et al. [25] 
propose a probabilistic model based on Bayes' theorem. Al- 
though this approach is mathematically elegant, its prob- 
abilistic hypothesis finally reduces it to a term frequency 
representation. 
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Figure 1: Distribution of the term t in document D, 
represented by a rectangular function. 



One of the first approaches applying Fourier analysis to 
term distributions in documents is Fourier Domain Scoring 
(FDS), proposed by Park et al. 22 . FDS performs a sepa- 
rate magnitude and phase analysis of term position signals 
to produce an optimized ranking. It creates an index based 
on page segmentation, storing term frequency and approx- 
imated positions in the document. FDS processes the in- 
dexed data using the Discrete Fourier Transform to perform 
the corresponding spectral analysis. Our approach, on the 
other hand, represents the term signal information (Fourier 
coefficients) directly as an n-dimensional vector using the 



The cardinality {Vtl of Vt is the total number of occurrences 
of t in the document. The characteristic function 

for x G [p- l,p] if p G 
^ > otherwise ^ > 

is assigned to Vt for x £ [0,L]. The Fourier coefficients of 
are given by 
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Figure [2] shows the Fourier representation of the step func- 
tion for the positions Vt = {3, 8} of the term t in 
document D, calculated for different Fourier orders n — 
2,4,6,8. 
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Figure 2: Fourier distribution of Vt ~ {3,8} in docu- 
ment D, using different Fourier orders. 



4. COMPARING TERM DISTRIBUTIONS 

The underlying concepts of the proposed approach are: 

• The positions of content terms in a document influence 
its relevance evaluation in the retrieval process. 

• If two content term distributions are similar, then the 
corresponding terms are located in a similar document 
region, implying some semantic relationship between 
them [20]|3)|36]. 

• The algorithm to compare two term distributions has 
to be computationally simple such that it can be per- 
formed under realistic conditions. 



We argue that finite order Fourier approximations provide a 
systematic way to characterize and analyze the positions of 
terms. Applying a Fourier approximation of order n reduces 
the data necessary to describe the term distribution to 2n+l 
real numbers. 

In addition, the finite approximation allows to exploit the 
broadening effect on the original function (Figures [2] |3|, 
defining a certain neighborhood around each term position. 
This broadening effect provides an instrument for estimating 
the similarity between terms within a document. 




< ► 

overlapping region 

Figure 3: The broadening of the approximated term 
distributions, defining the term neighborhoods Na 
and Nb and the corresponding overlapping region. 



4.1 Comparing the Term Distribution Func- 
tions 

In this section, the notion of similarity of two term distribu- 
tions is defined. For a term distribution f{x), the n-th order 
Fourier approximation /„ (x) is considered and its Fourier co- 
efficients are used to form the 2n -j- 1 dimensional real vector 
/„ = (ao, ai,bi, . . . ,a„, 

The similarity of two term distributions can be related to 
the overlap integral 



(6) 



The overlap integral measures in which regions of the inte- 
gration range both functions are large (see Figure |3| . An 
important property of the Fourier expansion ([TJ is that the 
overlap integral can be easily expressed by the spectral vec- 
tors /„ and /„: 

n 

iU, f'n) = aoa'o + ^(afctt'fc + bkb^) ^ fn ■ fn (7) 
fc=i 

i.e. the overl ap i ntegral is just the scalar product of the spec- 
tral vectors [34]. Since the functions / and /' can represent 
terms from documents of different lengths, the overlap inte- 
gral ([6| is not used directly to define the similarity of term 
distributions, but i nstead th e overlap of the normalized term 
distributions fnl \J {fn, fn) is used. It is simply the cosine 
of the angle between the spectral vectors: 
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Here, the length of the spectral vector is given by 
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4.2 Relevance Ranking Optimization 

We state the document ranking problem as an optimization 
problem that is based on the query term distribution func- 
tion fq^d and a user defined objective function /□ representing 
the optimal query term distribution in the document body: 



Maximize {sim(/q,d, /o)} 



(9) 



where A represents the set of query term distributions in 
an initial document ranking, fq^d is the query term distri- 
bution function for query q in document d, and /o is a user 
defined objective function, representing the optimal query 
term distributions for the documents in the ranking. 

For queries consisting of multiple terms, the distribution 
function is the sum of the single query term distributions. 

Applying expression a new sorted set of documents with 
a maximum similarity between each document distribution 
fq^d and the objective function /o is obtained. In other 
words, we get a new ranking in which the searched terms 
are distributed similarly to the optimal query term distribu- 
tion described by 

Figure |4] illustrates several basic objective functions to iden- 
tify documents where query terms are distributed in partic- 
ular document regions. The following nomenclature is used 
to define an objective function; 

Definition 1. The expression "fo : represents an 

objective function to evaluate the relevance of documents 
with respect to the position of specific terms. Each docu- 
ment is divided into Y equally sized sections of length y. 
The terms situated in the X*^ section increase the docu- 
ment's relevance in the ranking. 

For example, the objective function /□ : 1|1 can be used 
to search for documents in which content terms (keywords) 
are distributed within the whole document body. It allows 
to identify so-called topical documents [2l], where multi- 
ple keyword instances {topical terms) represent the intensity 
with which a concept is treated within the document. 

More sophisticated objective functions, such as fo : 1|2 and 
fo ■ 1|3 -|- 3|3, can be used if the user is interested in doc- 
uments where the contents of the first, or the first and the 
last section is more relevant. An example is the search for 
scientific papers where the abstract, the introduction (first 
sections) and the conclusion (last section) typically contain 
the most condensed document information. Another exam- 
ple might be a newspaper article, where readers expect to 
find the most relevant information at the top of the docu- 
ment. 

1 , 1 , 1 



topical documents scientific papers newspaper articles 

Figure 4: Examples of objective functions 

Comparing the term distribution of our sample document 
D (Figure [1} to Figure |4j it can be observed that D will be 
only considered as relevant if the applied objective function 
resembles the pattern fo ■ 1|3 -|- 3|3. 

4.2.1 Algorithmic Complexity and Index Represen- 
tation 



Table 1: Similarity and ranking for the query 
"brasil" and three arbitrary TREC documents us- 
ing the objective functions: fo : 1|2 and fo : 1|1. 



document 


fo ■■ 1|2 


fo ■■ 1|1 


Sim 


rank 


Sim 


rank 


FT944-15312 


0.9314 


1 


0.6067 


2 


FBIS3-10730 


0.5950 


2 


0.6053 


3 


FT931-11717 


0.5277 


3 


0.6594 


1 



Each term distribution function (i.e. their Fourier coeffi- 
cients) can be obtained using an algorithm with a complexity 
of 0{rj), where rj — termFrequency * fourierOrder, and it 
will typically be executed in indexing time. 

The most efficient index structure for text query evaluation 
is the inverted file: a collection of lists (one per term) record- 
ing the identifiers of the documents containing that term [4] . 
An inverted file index consists of two main components: a 
vocabulary and a set of inverted lists. The inverted lists are 
represented as sequences of <d,Ud,t> pairs, where Vd,t is 
the frequency of term t in document d. This is the stan- 
dard document-level index in which term positions within 
documents are not recorded. In the proposed approach, this 
index is augmented with Fourier coefficients: 

<d,a{,*),4*',&^*',...,a«,6W> (10) 

where n is a predefined Fourier order and a^*' , 6^'^ are the 
Fourier coefficients representing the positions of term t in 
document d. Note that from ([3|, the component a^^ corre- 
sponds to the term frequency Ud,t. 

The Fourier coefficients are computed by the indexing pro- 
cess. It should be emphasized that at query time these coef- 
ficients will be used to evaluate the similarity score between 
terms, by applying a simple scalar product calculation. We 
call this method Fourier Vector Scoring (FVS). 

4.2.2 An Example 

Let us consider three arbitrary documents from the TREC-8 
document collection containing the term "brasil". The corre- 
sponding term distribution functions will now be compared 
with different objective functions, simulating two particular 
ranking criteria. 

In Tablejl] the similarity for each document using the Fourier 
order n = 3 is shown. The applied objective function di- 
rectly influences the ranking configuration, obtaining the 
documents FT944-15312 and FT931-11717 with the higher 
similarity (relevance) values for fo : l!2 and fo : 1|1, respec- 
tively. 

Figure [5] indicates how documents whose term distribution 
approximates the applied objective function obtain a higher 
similarity value. For example, document FT944-15312 with 
fo : 1|2 obtains a similarity value of 0.9314, while the same 
document evaluated with fo : 1|1 has a similarity value of 
0.6067, lowering its relevance in the ranking. 

4.3 Query Expansion 

Query expansion (or term expansion) is a process of supple- 
menting the original query (g) with additional terms, with 
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Figure 5: The distribution of the term "brasiV 
in three TREC documents, applying the objective 
functions: fa : 1|2 (left) and /o : 1|1 (right). 

the aim of improving retrieval performance [12[ [6]. The 
use of query expansion strategies such as automatic local 
analysis typically has positive effects on the retrieval per- 
formance. Based on this observation, a new approach for 
query expansion is proposed, considering the top-r docu- 
ments D — {di,d2, ■■■ ,dr} of an initial ranking process. 

The function fq^d represents the distribution of the query 
term q for each document d £ D. The set of terms Tq whose 
elements t maximize the expression sim(/g, d, /t,d) is com- 
puted. Using this expression, the terms for all documents 
in D that have a similar distribution as the query, i.e. terms 
positioned near the query in the top ranked documents, are 
obtained. 

Taking a look at the term positions of a typical TREC-8 
document (see Figure [sjl , it can be observed how the simi- 
larity criterion reflects the location properties of distant and 
neighboring terms (see Figure [7|. The term "brasil" and its 
neighbor term '^Portuguese" have a high similarity value of 
0.9490, while its similarity value with respect to the more 
distant term "chile" decreases to 0.0533, which is about 20 
times smaller. Thus, the proposed method is quite sensitive 
with respect to the location properties of terms. 

The expanded query is the set 

Tq = {ti,T2, . . . ,Tk} (11) 

consisting of the k best related query terms in D, obtained 
by ranking the terms according to the expression 

sim(/,,d,/r„d),VdGi3, (12) 

The maximization process requires a simple comparison us- 
ing the scalar product and norm of the corresponding Fourier 
coefficients, i.e. the algorithm to calculate the expanded 
query terms has a computational complexity of 0(t?), where 
rj = \D\m + m log m, and m is the number of terms in each 
document in D. 

5. EXPERIMENTAL RESULTS 



<DOC> 

<DOCMO> FBIS3-10730 </DOCND> 
<HT> "drlat048_n_94005" </HT> 
<HEADER> 

<AU> FBIS-LAT-94-048 </AU> 
Document Type; Daily Report 
<DATE1> 11 Mar 1994 </DATEl> 
</HEADER> 

<F P=100> Chile </F> 

<H3><TI> Brazil's Franco Completes Schedule 

Despite Flu </TI></H3> 

<F P=102> PY1103004294 Brasilia Voz 

do Brasil Network in Portuguese 

2200 GMT 10 Mar 94 </F> 

<F P=103> PY1103004294 </F> 

<F P=104> Brasilia Voz do Brasil Network </F> 
<TEXT> 

Language: <F P=105> Portuguese </F> 
Article Type;BFN 

[Text] Although he has the flu and a fever of 38 degrees 
centigrade, President Itamar Franco is carrying out all 
commitments included on the agenda of his visit to Chile. 
</TEXT> 
</DOC> 



Figure 6: A typical TREC-8 document. 
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Figure 7: Term neighborhood analysis for a typical 
TREC-8 document. 

The TREC-8 document collection has been used to mea- 
sure the performance of the proposed approach. The goal 
of this evaluation is to determine how well the algorithm is 
able to identify documents based on a predetermined objec- 
tive function, and to compare the proposed query expansion 
approach with some of the state-of-the-art models. 

The evaluation framework consists of the following compo- 
nents: (a) the Ad hoc Test Collection containing 556,077 
documents (2.09 Gigabytes) corresponding to the Tipster 
disks (3 and 4), (b) the Topics and Relevance Judgments 
(qrels), (c) our approach consisting of 4 Java modules for in- 
dexing, search, graphical evaluation and configuration tasks, 
and (d) the results analysis where the effectiveness of our 
approach will be estimated. 

5.1 Objective Function Runs 

In this experiment, it will be analyzed how the query terms 
(TREC topics) are distributed in the top- 10 ranked docu- 
ments for three different ranking schemes: (a) tfidf (baseline) 
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Figure 8: Analyzing the query term distribution skewness for two different ranlcing schemes: fo : 1|3 and 
fo ■■ 3j3. 



and two objective functions: (b) fo : IjS and (c) fo : 3|3. 

To measure how the developed ranking algorithm follows the 
proposed objective functions, the skewness ^7] of the term 
position distributions is calculated and their asymmetry is 
compared with the tfidf scheme. To obtain relevant statisti- 
cal results, only topics that return more than 10 document 
hits were considered. 

As depicted in the first graph of Figure [s] (skewness), the 
query term distribution in the tfidf ranking has a skewness 
of around zero, i.e. the terms are evenly distributed. Apply- 
ing both objective functions, it can be observed how in the 
optimized ranking the query terms approximate the corre- 
sponding objective function: the ranking based on fo : 1|3 
shows a positive skewness, demonstrating that terms are 
mainly situated in the header of the ranked documents. On 
the other hand, applying /o : 3|3 generates a document rank- 
ing where query terms are predominantly distributed at the 
document's bottom (negative skewness). 

The last two graphs of Figure [8] show the rate of query terms 
fitting the proposed objective functions. For example, the 
fo : 1|3 function applied to topic 420 produces a document 
ranking where 68% of the query terms are situated inside 



the objective function region, while the tfidf ranking re- 
turns only a fitting rate of 26%. Analyzing all experimental 
results, it can be observed that by applying the proposed 
approach to TREC-8, about 67% of the query terms (from 
the top ranked documents) are positioned inside the defined 
objective function region. Therefore, it is evident that the 
ranking process can be fiexibly optimized, providing new 
possibilities to express the information need of the user. 



5.2 Query Expansion Runs 

Our query expansion experiments are based exclusively on 
the search results. No external knowledge structure was used 
to leverage the re-ranking procedure. 

In the group of runs described in the following, the proposed 
query expansion model based on the query terms distribu- 
tion (fq) is evaluated. 

Using the top-n ranked documents, the query distribution 
function fq for each ranked document is obtained, and the 
terms having a similar distribution as fq are calculated. 
Based on equation (12 1, the first k candidate terms Tq = 
{ti, T2, . . . , Tfe} for query reformulation are obtained and the 



new ranking using our test collection is evaluated. 



The query reformulation and ranking procedure consists of 
the following steps: 



1. Calculate the expanded query terms based on the 
top-n documents from the tfidf ranking. 



2. Using Tq , calculate the expanded query 



Qc = {woq, Win, W2T2, 



WkTk} (13) 



where Wi is a weighting factor corresponding to the 
similarity between the original query q and the term 

Ti- 

3. Perform the tfidf -search with q^. 



Using the top-10 ranked documents and the first 40 terms 
having the highest query similarity, the proposed Fourier 
Vector Scoring (FVS) query expansion method for ui; = 1 is 
compared with eight state-of-the-art query-expansion meth- 
ods: Rocchio for P = 0.2, 0.4, 0.6, 0.8, 1.0 (Ro.2, Ro.4, 
R0.6, R0.8, Rol) [31], Bose-Einstein 1 (Bol), Bose-Einstein 
2 (Bo2) [1] and KuUback-Leibler (KL) [lO]. For the query 
expansion experiments, the Terrier [28] software was used. 



Considering the measures of relevance preciston and preci- 
sion at 10 documents, it can be observed from Figure |9] that 
FVS outperforms all other query expansion methods. 
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Table 2: Examples of query expansion terms for 
some arbitrary TREC-8 Topics. 



Topic 


Title Query 


Terms for Query Expansion 


403 


osteoporosis 


bone, women, calcium, health, 
risk, study, claim, research 


406 


Parkinson's dis- 
ease 


brain, research, cells, london, drug, 
symptoms, alzhcimer, fetal 


408 


tropical storms 


July, disaster, area, Caribbean, hur- 
ricane, texas, georgia, tempera- 
tures 


417 


creativity 


people, mental, illness, scientists, 
part, human, children, depression 


421 


industrial waste 
disposal 


management, facilities, hazardous, 
radioactive, solid, company, state, 
site 


427 


UV damage, eyes 


radiation, rays, sunglasses, pro- 
tect, adhesive, patch, exposure, 
children 


429 


Legionnaires' 
disease 


nosocomial, hyph, infection, con- 
trol, patients, prevention, pneumo- 
nia 


431 


robotic technol- 
ogy 


robot, manufacturing, industrial, 
system, company, human, industry 



Ro.2 Ro.4 R0.6 R0.8 Rol FVS KL Bol Bo2 
query expansion approaches 



coefficients. By using query objective functions for predeter- 
mined document regions, the approach provides new ways to 
define or refine queries. Furthermore, a novel query expan- 
sion methodology has been presented to support the user in 
the query refinement process. 

An evaluation of our proposal using the TREC-8 collection 
has demonstrated that 67% of the query terms are positioned 
inside the user defined objective function region. A further 
analysis has shown that using the proposed approach to gen- 
erate expanded query terms leads to a performance gain over 
state-of-the-art query expansion models such as Rocchio and 
Divergence from Randomness models. 

There are several issues for future work. For example, it 
would be interesting to study the possibility of generating 
optimized objective functions by training our approach with 
particular document categories such as medical, juridical, 
scientific papers, etc. A further topic to be investigated is 
the compression of the index information, because the size 
required to save the spectral information of low frequency 
terms currently exceeds the size of the positional information 
of terms. Finally, the presented approach is not limited to 
the Fourier expansion, but can be generalized to other kinds 
of series to represent the term distribution functions. 



Figure 9: Ranking improvements using query ex- 
pansion. 

Table 2 shows the most relevant expanded terms, listed in 
descending relevance order, for eight arbitrary topics from 
the test collection. The same term sets were also used in the 
query expansion runs. 

6. CONCLUSIONS 

In this paper, term distribution analysis using Fourier se- 
ries expansion has been proposed as a novel methodology to 
improve document relevance evaluation in information re- 
trieval applications. The proposed approach is based on a 
Fourier series representation of the term positions in a docu- 
ment collection, by calculating the corresponding expansion 
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